scycle
  • Home
  • Basic pipeline
  • Integration of cell lines
    • Introduction
    • Head and neck cell line integration
    • Run scycle on integrated dataset
    • Inspect results
scycle
  • Docs »
  • Integration of cell lines

Integration of cell lines¶

Introduction¶

If you haven't checked out our basic pipeline, we'd avise starting there as it will explain the different steps in greater detail.

We'll start by loading the single-cell RNA-seq dataset from the CCLE, which contains data on 200 cell lines. This is a very large dataset, so it may take some time to download.

In [1]:
import scycle as cc
import scanpy as sc

sc200 = cc.data.get_data('sc200_CCLE')
sc200
-- Loading data from cache...
Out[1]:
AnnData object with n_obs × n_vars = 42362 × 30314
    obs: 'Cell_line', 'Pool_ID', 'Cancer_type', 'Genes_expressed', 'Discrete_cluster_minpts5_eps1.8', 'Discrete_cluster_minpts5_eps1.5', 'Discrete_cluster_minpts5_eps1.2', 'CNA_subclone', 'SkinPig_score', 'EMTI_score', 'EMTII_score', 'EMTIII_score', 'IFNResp_score', 'p53Sen_score', 'EpiSen_score', 'StressResp_score', 'ProtMatu_score', 'ProtDegra_score', 'G1/S_score', 'G2/M_score', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

We'll first apply standard pre-processing using a scanpy recipe without pooling to visualize the whole dataset.

In [2]:
cc.pp.filter_cells(sc200)
sc.pp.recipe_zheng17(sc200, n_top_genes = 10000)
34920 cells pass the count filter
42362  cells pass the mt filter
Cells selected 34920
In [3]:
sc.pp.pca(sc200)
sc.pp.neighbors(sc200)
sc.tl.umap(sc200)
sc.pl.umap(sc200, color='Cancer_type')
/home/clarice/.local/lib/python3.8/site-packages/sklearn/manifold/_spectral_embedding.py:245: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.

We can see, as was also described by the publication of these data, that cell lines are quite heterogenous by tissue of origin. We can better inspect it using an interactive plot:

In [4]:
import pandas as pd
import plotly.express as px

sc200_df = pd.DataFrame(sc200.obsm['X_umap'], index = sc200.obs.index, columns = ['x', 'y'])
sc200_df['Cancer_type'] = sc200.obs['Cancer_type'].values
sc200_df['Cell_line'] = sc200.obs['Cell_line'].values
ct_colors = sc200.uns['Cancer_type_colors']

sc200_df = sc200_df.sort_values('Cancer_type')


fig = (px.scatter(sc200_df, x='x', y='y', color ='Cancer_type', 
                  color_discrete_sequence = ct_colors, hover_data=['Cell_line']))
fig.update_traces(marker = dict(size = 2))
fig.update_layout(scene = dict(xaxis_title = 'UMAP1', yaxis_title = 'UMAP2'))
fig
In [5]:
sc200_df.groupby('Cancer_type').count()['x'].sort_values(ascending = False)
Out[5]:
Cancer_type
Lung Cancer                   8692
Head and Neck Cancer          4719
Skin Cancer                   2777
Breast Cancer                 2200
Brain Cancer                  2038
Esophageal Cancer             1662
Pancreatic Cancer             1601
Ovarian Cancer                1427
Endometrial/Uterine Cancer    1412
Colon/Colorectal Cancer       1406
Kidney Cancer                 1103
Bladder Cancer                 967
Liver Cancer                   959
Gastric Cancer                 900
Thyroid Cancer                 684
Bone Cancer                    683
Sarcoma                        608
Bile Duct Cancer               353
Neuroblastoma                  263
Prostate Cancer                238
Fibroblast                     147
Gallbladder Cancer              81
Name: x, dtype: int64
In [6]:
sc200_df.reset_index()[['Cancer_type', 'Cell_line']].drop_duplicates().groupby('Cancer_type').count()['Cell_line'].sort_values(ascending = False)
Out[6]:
Cancer_type
Lung Cancer                   40
Head and Neck Cancer          19
Skin Cancer                   16
Breast Cancer                 15
Ovarian Cancer                13
Brain Cancer                  12
Pancreatic Cancer             11
Colon/Colorectal Cancer       10
Endometrial/Uterine Cancer    10
Esophageal Cancer              7
Liver Cancer                   7
Gastric Cancer                 6
Bladder Cancer                 6
Kidney Cancer                  6
Bile Duct Cancer               4
Thyroid Cancer                 4
Sarcoma                        3
Bone Cancer                    3
Neuroblastoma                  2
Prostate Cancer                2
Gallbladder Cancer             1
Fibroblast                     1
Name: Cell_line, dtype: int64

The two cancer types with larger representation of cells and cell lines are Lung Cancer and Head and Neck cancer. We'll be subsetting the big dataset to look at cell cycle in the Head and Neck cancer cell lines.

Head and neck cell line integration¶

We'll reload and subset the data so we can process using scycle.

In [27]:
sc200 = cc.data.get_data('sc200_CCLE')
schn = sc200[sc200.obs['Cancer_type'] == 'Head and Neck Cancer']
-- Loading data from cache...

We can now look into the cell lines. we see that we have a few cell lines with a low cell count, and that SCC255 and JHU011 have the highest number of cells.

In [28]:
schn.obs.reset_index()[['NAME', 'Cell_line']].groupby('Cell_line').count().sort_values('NAME', ascending = False)['NAME']
Out[28]:
Cell_line
SCC25_UPPER_AERODIGESTIVE_TRACT         831
JHU029_UPPER_AERODIGESTIVE_TRACT        765
JHU011_UPPER_AERODIGESTIVE_TRACT        676
SCC47_UPPER_AERODIGESTIVE_TRACT         568
93VU_UPPER_AERODIGESTIVE_TRACT          525
SCC9_UPPER_AERODIGESTIVE_TRACT          450
YD38_UPPER_AERODIGESTIVE_TRACT          446
JHU006_UPPER_AERODIGESTIVE_TRACT        309
BICR6_UPPER_AERODIGESTIVE_TRACT         252
SNU46_UPPER_AERODIGESTIVE_TRACT         234
BICR31_UPPER_AERODIGESTIVE_TRACT        210
BICR16_UPPER_AERODIGESTIVE_TRACT        202
PECAPJ49_UPPER_AERODIGESTIVE_TRACT      193
SNU1214_UPPER_AERODIGESTIVE_TRACT       175
SCC90_UPPER_AERODIGESTIVE_TRACT         155
DETROIT562_UPPER_AERODIGESTIVE_TRACT    150
BHY_UPPER_AERODIGESTIVE_TRACT           131
SNU899_UPPER_AERODIGESTIVE_TRACT         82
BICR56_UPPER_AERODIGESTIVE_TRACT         77
Name: NAME, dtype: int64

We'll follow the initial standard steps of scycle pipeline:

In [29]:
cc.pp.filter_cells(schn)
cc.pp.prep_pooling(schn)
cc.tl.dimensionality_reduction(schn)
cc.pl.cell_cycle_pca(schn)
Trying to set attribute `.var` of view, copying.
5221 cells pass the count filter
6431  cells pass the mt filter
Cells selected 5221
Preparing embedding...
WARNING: genes are not in var_names and ignored: ['H4C3', 'H1-4', 'H2AC20', 'H1-2']
Embedding for pooling...
Pooling 5221 cells...
Scoring cell cycle...
WARNING: genes are not in var_names and ignored: ['H4C3', 'H1-4', 'H2AC20', 'H1-2']
-- Dimensionality reduction using ICA...
FastICA from sklearn did not converge due to numerical instabilities - Retrying...
-- Done
--- Selected components:
G1-S: 1 (score=10.807820)
G2-M: 0 (score=13.598538)
G2-M-: 5 (score=11.215384)
Histones: 8 (score=3.913921)
Out[29]:
<ggplot: (8743875821581)>

We can see in the visualization that we don't quite get a "circle" where we can fit a trajectory. We'll dissect this further:

In [30]:
cc.pl.cell_cycle_pca(schn, col_var = 'Cell_line', alpha = 0.5, size = 1)
Out[30]:
<ggplot: (8743875495849)>

It seems that cells from the same cell line are following a roughly circular (or 'triangular') trajectory, and we're seeing all their trajectories stacked. Here's an example of single cell lines:

In [31]:
cc.pl.cell_cycle_pca(schn[schn.obs['Cell_line'] == 'SCC25_UPPER_AERODIGESTIVE_TRACT'], col_var = 'Cell_line', alpha = 0.5, size = 1)
Out[31]:
<ggplot: (8743875802100)>
In [32]:
cc.pl.cell_cycle_pca(schn[schn.obs['Cell_line'] == 'JHU011_UPPER_AERODIGESTIVE_TRACT'], col_var = 'Cell_line', alpha = 0.5, size = 1)
Out[32]:
<ggplot: (8743884166172)>

As JHU011 had a cleaner trajectory, we'll use it as the reference for integration. In scycle, integration is implemented using weighted optimal transport (WOTi):

In [33]:
jhu011 = schn[schn.obs['Cell_line'] == 'JHU011_UPPER_AERODIGESTIVE_TRACT']
In [34]:
cc.tl.integration(schn, jhu011)
-- Automatically detecting cell-cycle components...
-- Integrating datasets...
> Selecting common genes...
> 10000 genes selected.
> Slicing matrices...
> Performing optimal transport based integration using WOTi...
Woti > Successfully initialized.
(Woti) ot based -- max_iter: 10000000 -- entropy regularized, hreg: 0.001000 -- unweighted
WOTi > Computing optimal transport plan...
WOTi > Projecting source dataset...
-- Done

We can inspect the results of the integration below:

In [35]:
cc.pl.cell_cycle_pca3d(schn, col_var = 'Cell_line', alpha = 0.5, size = 1.5)

Now we can proceed with the analysis of all the head and neck cell lines.

Run scycle on integrated dataset¶

Now we can simply follow the pipeline:

  1. cc.tl.trajectory finds the circular trajectory
  2. cc.tl.cell_division cuts the trajectory by finding the moment of cell division based on total_counts
  3. cc.tl.pseudotime calculates a pseudotime between 0 and 1 based on the trajectory
  4. cc.tl.cell_cycle_phase assigns a cell cycle phase to each cell based on their pseudotime and on the curvature of the trajectory in the embedded space
In [36]:
cc.tl.trajectory(schn)
WARNING: The initial number of nodes must be at least 3. This will be fixed
In [37]:
cc.pl.cell_cycle_pca(schn, trajectory = True)
Out[37]:
<ggplot: (8743884788381)>
In [38]:
cc.tl.cell_division(schn)
Suggested moment of cell division: [0 1]
Direction of cell cycle: -1
Remapping edges using [0 1] ...
In [39]:
cc.pl.cell_division(schn)
Out[39]:
<ggplot: (8743875591747)>
In [40]:
cc.tl.pseudotime(schn)
Calculating pseudotimes for each cell...
In [41]:
cc.pl.pseudotime_histogram(schn)
Out[41]:
<ggplot: (8743885627645)>
In [42]:
cc.tl.cell_cycle_phase(schn)
-- Suggested cell cycle division:
G1:  0   - 0.3
S:  0.3 - 0.5666666666666667
G2: 0.5666666666666667 - 0.8333333333333334
M:  0.8333333333333334 -   1
In [43]:
cc.pl.cell_cycle_scores(schn, alpha = 0.25)
/home/clarice/.local/lib/python3.8/site-packages/plotnine/layer.py:467: PlotnineWarning:

geom_vline : Removed 1 rows containing missing values.

Out[43]:
<ggplot: (8743884744843)>
In [44]:
cc.pl.cell_cycle_phase_pieplot(schn)
Out[44]:
([<matplotlib.patches.Wedge at 0x7f3d78d5cfa0>,
  <matplotlib.patches.Wedge at 0x7f3d78d685e0>,
  <matplotlib.patches.Wedge at 0x7f3d78d68d00>,
  <matplotlib.patches.Wedge at 0x7f3d78d72460>],
 [Text(1.0928780327673469, 0.3579351973396502, 'M'),
  Text(-0.032172942279255615, 1.1495498692032007, 'G2'),
  Text(-1.111364982396941, -0.2955805742971746, 'S'),
  Text(0.5995896756789948, -0.9813216704114701, 'G1')],
 [Text(0.5226807982800353, 0.17118639872765876, '10.1%'),
  Text(-0.015387059350948338, 0.5497847200537047, '30.7%'),
  Text(-0.5315223828854935, -0.14136462248995305, '26.6%'),
  Text(0.2867602796725627, -0.4693277554141813, '32.5%')])

Inspect results¶

Now that the pipeline is complete, the results generated using scycle can also be easily accessed by scanpy and used in relation to other variables. Here, we'll simply plot them in the UMAP projection.

In [45]:
sc.pp.pca(schn)
sc.pp.neighbors(schn)
sc.tl.umap(schn)
In [46]:
schn.uns['cell_cycle_phase_colors'] = ['#e78ac3', '#8da0cb', '#fc8d62', '#66c2a5']
sc.pl.umap(schn, color = ['Cell_line', 'G1-S', 'G2-M', 'pseudotime', 'cell_cycle_phase'], ncols = 1)
Previous

Built with MkDocs using a theme provided by Read the Docs.
« Previous